Ranking XPaths for extracting search result records
نویسندگان
چکیده
Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.
منابع مشابه
Sample-based XPath Ranking for Web Information Extraction
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper targets auto...
متن کاملAn Ensemble Click Model for Web Document Ranking
Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...
متن کاملA Scalable Image Snippet Extraction Framework for Integration with Search Engines
Search result visualization is a task performed by search engines that enables users to find their desired documents, in an effective and efficient manner. Image based summary or best images of a web document, displayed as a part of the visualization process, has become indispensable, as a human perceives images instantaneously. But, selection of the best image increases latency in search resul...
متن کاملAccelerating E-Commerce Search Engine Ranking by Contextual Factor Selection
In industrial large-scale search systems, such as Taobao.com search for commodities, the quality of the ranking result is getting continually improved by introducing more factors from complex procedures, e.g., deep neural networks for extracting image factors. Meanwhile, the increasing of the factors demands more computation resource and raises the system response latency. It has been observed ...
متن کاملIncremental Web Search: Tracking Changes in the Web
A large amount of new information is posted on the Web every day. Large-scale web search engines often update their index slowly and are unable to present such information in a timely manner. In this thesis, we present our solutions of searching new information from the web by tracking the changes of web documents. First, we present the algorithms and techniques useful for solving the following...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012